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Abstract —Identifying communities has always been a funda¬ 
mental task in analysis of complex networks. Many methods have 
been devised over the last decade for detection of communities. 
Amongst them, the label propagation algorithm brings great 
scalability together with high accnracy. However, it has one major 
flaw; when the commnnity strnctnre in the network is not clear 
enongh, it will assign every node the same label, thus detecting 
the whole graph as one giant commnnity. We have addressed this 
issue by setting a capacity for commnnities, starting from a small 
valne and gradnally increasing it over time. Preliminary results 
show that not only our extension improves the detection capability 
of classic label propagation algorithm when communities are not 
clearly detectable, but also improves the overall quality of the 
identified clusters in complex networks with a clear commnnity 
structure. 


I. Introduction 

Complex networks appear in a wide variety of domains. As 
a result, studying the structure of such networks has attracted a 
tremendous amount of attention throughout years. Real-world 
networks are usually comprised of communities - informally 
described as a group of nodes with a dense connection 
between themselves and a loose connection to the rest of 
the network. As the building blocks of complex networks, 
communities reveals invaluable information about key features 
of the network. Retrieving the community structure can help 
us find the functional modules in biological networks, find 
groups of cohesive data cubes in large-scale databases, find 
groups of users in online social networks with similar attributes 
and interests, thus enabling us to develop effective marketing 
strategies in such networks, predict future interactions between 
users or study the emergence and popularity of ideas in social 
media |[Tl]||3] ll^. Eor a great survey refer to 0]. 

There is no single definition of communities. 
iRadicchi et al.l define communities in two senseslH. Let 
node V be in community C with degree dy, then the number 
of links between v and other nodes within C will be and 
the number of links between v and the rest of the network 
will be A subgraph U of a graph G is a community in 
a strong sense if: 


Vu G U : > dr* (1) 

Similarly a subgraph V of graph G is a community in a 
weak sense if: 

^ dr > X (2) 

vGV vGV 


There are various measures quantifying the quality of 
clusters. One of the most popular measures is modularity^. 
Knowing that real-world networks possess strong community 
structure compared to random networks, modularity measures 
the difference between a given partitioning in a certain graph 
and the same partitioning in a random graph with the same 
distribution of degrees. Modularity in a partitioning with G 
clusters can be written as below: 


Ci&C 


(3) 


Where nii is the number of edges inside partition Ci, di is the 
total degree of nodes inside Ci, C is the set of all clusters and 
M is the total number of edges. 


The growing need for retrieving communities from com¬ 
plex networks in addition to rapid growth of the size of data, 
highlights the importance of scalable and accurate methods 
to detect communities in such networks. With modularity 
as objective function, graph partitioning can be seen as an 
optimization problem. Many algorithms for community detec¬ 
tion have been proposed over the last decade, setting modu¬ 
larity optimization as their ultimate goal, which has proved 
in practice to retrieve communities of great quality 10100. 
There are a class of algorithms exploiting the power of linear 
algebra to detect communities using the eigenvectors of the 
Laplacian matrix lfl^lITTl] . Another class of algorithms use the 
fact that clusters have weak connection between themselves; 
knowing this, this class of algorithms find minimal cuts and 
divide the graph recursively, leading to a dendrogram of node¬ 
cluster membership ifO ifo . There are also novel approaches 
that do not completely fit in the previous categories which 
are built around statistical and mechanical phenomena in real 

worldlHlIilllll. 


All these methods try to optimize a global objective 
function or use the whole structure of a network to divide 
it into clusters. There is a problem with this kind of approach, 
especially encountered in social networks; individuals in such 
networks does not join communities to increase a global 
quality function, they rather join them to improve their own 
utility function, be it more enjoyment through joining a group 
of people with similar interests in Eacebook, or following 
a politician in Twitter in order to stay in touch with the 
latest news in politics 00] . Apart from that, it is proved that 
algorithms based solely on modularity optimization fails to 
detect communities with small size as the size of the network 





increases, which is famously known as resolution limit of 
modularityAll this evidence lead us to choosing a node- 
centred approach. As an example of node-centred approaches, 
some algorithms employ a game-theoretic approach to find 
communities. Let nodes be agents in a game with a personal 
utility function. A strategy in this game which leads to a 
nash or local equilibrium will yield a community structure 
in the network lT^ . Another approach with near linear time is 
Label Propagation Algorithm{LPA)^^. In this method, every 
node is assigned a unique label in the initial condition of 
the network. Afterwards, in each iteration, nodes are traversed 
using a random order and every node acquires the label which 
is most frequent among its neighbors. This method has one 
serious problem though; in some networks, one of the labels 
will over-propagate and sweep through other labels, leading to 
detection of the whole network as one giant community, which 
renders this method counterproductive in dense networks with 
unclear community structure. This phenomenon is often called 
flood-fills. The focus of this article is overcoming this flaw and 
improving the accuracy of LPA when community structure is 
hardly detectable. 

The rest of the article is structured as follows. Section II 
reviews potentials and flaws of LPA along with subsequent ex¬ 
tensions to overcome its flaws. Section III presents the formal 
definition of Controlled Label Propagation Algorithm using 
gradual expansion of communities and discusses the rationale 
behind choosing this approach. In Section IV we empirically 
compare our method with LPA and two of its extensions along 
with several state-of-the-art algorithms in real-world networks 
in addition to standard network benchmarks. Lastly, section V 
discusses further research related to our work in addition to a 
conclusion. 

11. Background and Motivation 

In this section we will discuss the considerable potentials in 
LPA and why it is important to improve this method for further 
uses. We will also cover some of the flaws of this algorithm 
and previous efforts do deal with these flaws. 

A. Where label propagation prevails? 

One of the biggest advantages of LPA is its time-efficiency. 
Raghavan et al. state that 95% of the nodes reach their final 
state in 5 iterations ifl^ . Although the experiments by various 
authors suggest that the number of iterations needed to reach 
equilibrium state, I, grows very slowly w ith the size o f the 
network, it is not fully understood yet. iLeung et alJ have 
created a class of networks on which I increase logarithmically 
with the size of the network !^ . Although this class of 
networks is highly unlikely to appear in natural graphs, it 
shows that the efforts to And an upper-bound for I has gone 
no further than log TV in a network with size TV. Since the time 
complexity of LPA is 0{MI) in a graph with M edges, in the 
worst case, its time complexity will be 0{M log N) which is 
still considerably fast compared to other methods. 

Apart from its time-efficiency, LPA can be trivially ex¬ 
tended to directed and weighted graphs with negative links. 
It also does not need to know the number of communities a 
priori, so in domains with zero knowledge about the network, 
it can be quite useful. 


Leung et al. believe that since every node only requires 
information from its neighbors, LPA can be easily run in 
a distributed environment in almost constant time COT . This 
is particularly important with the emergence of ubiquitous 
computing and mobile social networks. All the computation 
will be distributed into nodes and since every node has 
very few neighbors compared to the size of the graph, the 
community structure of large-scale networks can be revealed 
in little time. The authors also suggest that since every node 
updates its state from current formation of its neighbors, LPA 
can be used to detect communities in a dynamic environments 
when nodes and links might come and go ll^ . Furthermore, 
nowadays there is growing concern regarding the privacy of 
information in social networks. Keep in mind that every node 
only receives information from nodes he is already in contact 
with. In addition, the topology of the whole graph is never 
revealed to the nodes. As a result, LPA also guarantees an 
acceptable level of privacy when run on social networks. 

Moreover, the approach adopted by LPA algorithm is 
completely intuitive regarding real world communities. Con¬ 
sider the following scenario; there are TV people invited to a 
ceremony. Assuming friendship to be a zero-one relationship, 
in which zero means no friendship and one means friendship, 
their friendship status forms an undirected and unweighted 
graph. After the ceremony starts, people will try to form 
circles with their friends in order to be near them and enjoy 
their companionship. If we rearrange the circles with the 
communities retrieved from their friendship graph using LPA, 
the new circles will remain the same throughout the ceremony, 
since no one can join a better circle. In other words, there is 
no one dissatisfied by her community and these communities 
provide an equilibrium state in real life. Formally, we can 
call a node v in community Ci dissatisfied if there exists 
another community Cj such that dv^ > where 

is the number of links from v to nodes within C. As an 
inherent characteristic of LPA, there are no dissatisfied nodes 
in the retrieved communities. Flowever, this does not hold for 
many other algorithms. We have tested several algorithms, 
namely multilevel modularity optimization of Blondel et al.lf^, 
greedy modularity optimization of Claust et alJ® and random 
walk community detection of Rosvall et al. Klll . to find the 
percentage of dissatisfied nodes in a network given a com¬ 
munity structure by these algorithms [Results are omitted]. 
Although our experiments did not show a considerable portion 
of nodes being dissatisfied (on average, less than 5% in five 
collaboration networks retrieved from 023 ]), keep in mind that 
this tiny portion of nodes can be considered as false positives 
in domains where it is crucial that nodes belong to the optimal 
community in their own perspective, like in social networks. 

B. Where label propagation fails? 

As we mentioned before, on certain graphs, LPA fails to de¬ 
tect the community structure of the graph and reports the whole 
graph as one community. This is apparently because the speed 
of formation for different communities in the network varies 
significantly. In other words, the core of stronger communities 
are formed in the early stages, while weaker communities has 
not yet reached consensus. Furthermore, when a community 
has not reached its final state, it is comprised of several smaller 
pieces with different labels, possessing a weak core which 




is vulnerable to propagation of foreign labels. In a nutshell, 
stronger communities, exploiting the lack of unity in weaker 
communities, often sweep through the cores of small pieces 
of weaker communities and attract all of their members. This 
usually leads into one giant community and several smaller 
ones. On rare cases, the situation is exacerbated when there is 
one label strong enough to overwhelm all other labels, leaving 
every node with the same label in the end. 

Leung et al. have addressed this issue by employing a 
technique called Hop Attenuation^^. This technique is based 
on the observation that the diameter of communities ought 
to be tiny proportional to that of the whole graph. With 
hop attenuation, the labels lose their strength while they 
propagate. Meaning that after traversing away from its center, 
the label’s strength is exhausted and will ultimately vanish. 
This phenomenon can be formally described as below: 

Sy{C) = max(S'„(£) : u € N(v)) — S (4) 

Where Sy{C) is the strength of label £ when propagated 
by node v, Af{v) is the neighborhood set of node v and S is 
a parameter used to decrease the strength of labels when they 
traverse a single link. The authors realized that choosing a 
constant S irrespective of the network may lead to a substantial 
decrease in the quality of communities. Knowing this, they 
proposed some methods to adaptively change 5 through either 
using the current number of iterations or simply know it a 
priori as a parameter given to the algorithm. 

iLeung et alJ used another powerful heuristic to avoid im¬ 
balanced propagation of the labels called Node Preference. 
They realized that blindly valuing every neighbor of a node 
the same might be an apparent reason for the over-propagation 
phenomenon. Given that finding maximal label in LPA is 
simply finding the label with maximum occurrence among 
neighbors, Leung et al. changed the formulation to the fol¬ 
lowing: 

£;+^=argmax ^ S'„(£^)/(u)'" (5) 

^ ueN{v) 

Where £J, is the label assigned to node v in iteration i, 
Sv{C\,) is the strength of u’s label in iteration i, M{v) is the 
neighborhood set of v and f{v) is any comparable function on 
nodes, such as betweenness centrality im, degree centrality or 
any other measure. 

ISubeli and Baied have taken advantage of both node pref¬ 
erence and hop attenuation and proposed two new strate¬ 
gies called defensive preservation and offensive expansion of 
communities . In defensive preservation, the preference is 
given to nodes in the core of communities. On the other hand, 
in offensive expansion, the preference is given to nodes in the 
border of communities. The core and the border are identified 
using a simple random walk algorithm. They have devised 
two algorithms, namely DDALPA and ODALPA, for defensive 
and offensive strategies respectively. The authors show that 
DDALPA results in high recall, whereas ODALPA gives high 
precision. Furthermore, in an attempt to take advantage of 
merits in both DDALPA and ODALPA, they have created a 
method called BDPA which simply runs them one after another 


to find communities of good quality in small-size networks. 
Finally, exploiting the core-periphery structure of large-scale 
networksthey developed a hierarchical algorithm, called 
DPA, that has the efficiency of BDPA in addition to recursively 
finding communities in the giant core of the graph. Going into 
the details of these algorithms is out of the scope of this article, 
it suffices to say that the results of both algorithms, especially 
DPA, on large-scale graphs is comparable to state-of-the-art 
algorithms in terms of modularity measure. 

iBarber and ClarkI took a different path. They realize 
that over-propagation stems from the objective function of 
LPA ll^ . Let M be a matrix of N rows and C columns (C 
is the number of communities), then is one, if node i has 
label j, or zero otherwise. Each row in this matrix is the mem¬ 
bership vector of a node. When a node is updated, it adopts 
the most frequent label among its neighbors, thus increasing 
the sum of dot products between its own membership vector 
and that of its neighbors. Formally, when we update node v, 
we increase: 

c 

E E MyiMui (6) 

u^J\f{v) i—l 

Applying adjacency matrix. A, to (|6]) yields: 

N C 

EE MyjMijAyi (7) 

i=l j = l 

Finally, since LPA does this for every node in an iteration, 
the overall objective function, H, of LPA becomes: 

N N C 

H = EEE MikMjkAij (8) 

i—1 j—1 k—1 

With a little thought, it will become clear that LPA is 
simply increasing the number of edges connecting two nodes 
within the same community. Thus, the algorithm reaches its 
global maximum when every n ode is assigned the same label. 
To avoid this undesirable result. Barber and ClarkI changed the 
objective function of LPA to below: 

H' = H -XG (9) 

Where H is the previously mentioned objective function 
of LPA and G is a penalty function that diverts the algorithm 
from its previous undesir ed global maximum with a coeffi¬ 
cient A. Barber and ClarkI have proposed several G functions, 
two of them changes H into modularity of unipartite and 
bipartite networks thus enabling LPA to locally maximize 
modularity. This is a very impressive approach since it keeps 
the potential benefits of LPA (parallelism, privacy insurance, 
etc.) while overcoming its flaws and retrieving communities of 
excellent quality. 

We now look at the flood-fill phenomenon from a dif¬ 
ferent angle. LPA has two modes, asynchronous and syn¬ 
chronous]^. They both provide similar results, but the syn- 
chrobous mode gives a better insight about the underlying 
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III. Controlled Label Propagation 



Fig. 1: (Color online) We have calculated the attraction power 
(.4) of labels in the six networks, which is an estimation of 
the number of nodes holding each label after the first iteration 
of LPA. As the plot suggests, attraction power is much more 
equally spread across nodes in the five co-authorship graphs 
compared to the blog-catalog network. The results show that 
it might be too late to prevent a label from flooding the 
whole graph even after the first iteration. It also shows that 
we can determine how likely LPA is to fail based solely on 
the structure of the graph 


reasons of flood-fills. At the first iteration of LPA, every node 
is assigned a unique label, thus every label in the neighborhood 
of a node v has the same frequency. As a result, the probability 
of a node v choosing a label C in its neighborhood is where 
dy is the degree of node v. We can easily compute the expected 
number of nodes in a label after the first iteration as below: 


E[iv(/:,)]= ^ (10) 

vGATiu) 


Where Cu is the label first assigned to node u. We call 
the expected number of nodes choosing a label £„ after first 
iteration attraction power of node v and denote it by A{v). 


Before we go on and talk about preventing over¬ 
propagation, there is another problem with LPA that needs to 
be deal with. In LPA, when there are other labels in the neigh¬ 
borhood with the same frequency as a node’s current label, the 
existing label will not change. Meaning that if a bad decision 
is made in the early stages, there is no come-back mechanism 
to escape from it. In terms of objective function of LPA, there 
is no mechanism to escape from an early local maximum and 
we are satisfied with with the first loca l maximum we re ach. 
There has been a technique proposed by iBarber and Clarkitha t 
randomly walks on the local maximum to escape from ity^. 
This might be dangerous as it might randomly change the 
label of several border nodes, which decreases the defensive 
capabilities of a community and might lead to an avalanche 
effect in which nodes’ labels are changed one after another 
because of the change in border nodes’ labels, which leads 
to a drastic change in the community structure after some 
iterations. To prevent this, we have acquired a strategy inspired 
by Simulated Annealing^^ in which nodes tend to change a 
good label - a label with maximal frequency among the nodes’ 
neighbors - for another good label in the early stages. However 
they lose this tendency over time, meaning that they will hold 
their good label if there are no better labels in the final stages, 
since we do not want established cores to disappear. This can 
be done by a decreasing probability function p{t), starting from 
1 and ending in 0. If p{t) holds, we randomly choose a label 
among good labels; if not, we will hold the current label if 
there are no better labels. 

Now we can devise a method to prevent all nodes from 
ending up with the same label. We previously mentioned that 
the main reason behind flood-fills is the rapid formation of the 
core of some communities and sluggish growth of others. So 
the key to this problem is ensuring ’’fair” growth of communi¬ 
ties in each iteration. To ensure that weaker communities have 
a chance to grow, we put a capacity for all communities as a 
function C{t) where t is the current iteration. When a label’s 
population reaches the capacity, it can no longer attract new 
nodes. We define C is follows: 


C{t) = ( 


'kt' 

-T. 




( 11 ) 


We have calculated A for each label after the first itera¬ 
tion of LPA on five co-authorship networks ll^ . as examples 
of sparse networks with clear community structure, and a 
friendship network in blog-catalog j^ as an example of dense 
social network which LPA fails to retrieve its community 
structure. We have plotted the results, with nodes sorted 
descending by A, in FiglU As the plot suggests, the decrease 
in blog-catalog network is far more rapid compared to five co¬ 
authorship networks. The distribution of A is important for two 
reasons. First, it enables us to anticipate the failure of LPA in 
networks before the algorithm starts. Second, FiglU shows that 
imbalanced growth of communities starts from the very first 
iteration, meaning that any strategy to control over-propagation 
of labels based on the results of previous iterations may 
fail. This highlights the importance of monitoring growth of 
communities from the very first iteration. In the next section we 
propose a simple yet effective way to prevent over-propagation 
starting from the first iteration. 


Where k is the number of times we increase the capacity 
of communities (the maximum number of nodes a label can 
be assigned to), t is the current number of iteration, T is the 
maximum number of iterations and N is the size of network. 
It is clear that this function increases the capacity every ^ 
iteration by starting from ^ and ending in N. As a result, 
in the final stages there are actually no constraint on the size 
of communities. Furthermore, We call the iterations between 
two increases in capacity a cycle, so k will be the number of 
cycles. 

The rationale behind our strategy to stop popular labels 
from attracting new nodes is in fact giving a chance to weaker 
communities to form a core. In this way, a node v might choose 
a popular label C when it is not full. During the rest of the 
cycle, a number of nodes in v’s neighborhood find C full and 
join a sub-optimal label C'. With enough nodes joining £', v 
might change its mind and join C' during the later iterations 











TABLE I: Modularity scores for eleven data sets averaged on 
50 to 2 realizations based on the size of networks. We have 
also shown the variance of A as an indicator of unfairness of 
attraction power in the networks. 


Name 

N 

M 

var{A) 

CLPA 

LPA 

GRQ 

5.24K 

14.48K 

1.0 

0.797 

0.735 

HepTh 

9.88K 

25.97K 

1.1 

0.671 

0.627 

HepPh 

12.01K 

0.12M 

1.2 

0.497 

0.488 

CondMat 

12.01K 

0.12M 

1.3 

0.633 

0.578 

Astro 

18.77K 

0.20M 

1.1 

0.450 

0.323 

Enron 

36.69K 

0.18M 

80.8 

0.473 

0.338 

Brightkite 

58.23K 

0.21M 

6.9 

0.623 

0.557 

Gowalla 

0.20M 

0.95M 

102.4 

0.618 

0.503 

DBLP 

0.32M 

1.05M 

2.4 

0.697 

0.622 

Amazon 

0.33M 

0.93M 

2.1 

0.786 

0.709 

Youtube 

L13M 

2.99M 

233.9 

0.682 

0.555 


distribution of attraction power (A). In order to measure 
unfairness of A, we have calculated the variance of this value 
for all nodes in each of the networks and shown the results 
in Table]!] Note that out of five networks which yielded the 
highest percentage of increase in modularity, four are also in 
the top five networks with highest variance in A. Also, the 
five networks with the lowest increase in modularity are in 
the bottom six networks in regard to their variance in A. The 
only anomaly is Astro, which is the 9*^ network in respect to 
var{A), but yields the second highest increase in modularity. 
This shows that the failure of LPA is not solely related to 
unfair wiring of the links in the networks. This calls for an in- 
depth analysis to broaden our understanding of the underlying 
reasons of LPA shortcoming in different networks. 


of the current cycle and form the core of a new community 
along with a number of its neighbors. The cores might not 
be of good quality in the early cycles, but when we reach a 
new cycle and increase C, those nodes who are not content 
with their current label will join their desired label before it 
gets full again. However, the core nodes of a community who 
have acquired a label will keep their label and attract new 
nodes. In this way we are helping weaker communities to form 
their cores while still giving nodes complete freedom in the 
final cycles to choose their desired label. We call our method 
Controlled Label Propagation Algorithm (CLPA). 

In the next section we show the result of our tests and prove 
that our method is more effective than previously proposed ex¬ 
tensions to prevent flood-fills. Furthermore, the overall quality 
of clusters are also increased on the various datasets we tested 
our method on. 


IV. Experiments 


A. Sparse Networks 

We have tested CPLA against LPA on a wide variety 
of networks. Although increasing the quality of clusters was 
not our primary intention, the results reveal that on certain 
graphs, the modularity score of clusters yielded by CLPA is 
on average much higher than LPA. The data sets used for this 
part include five co-authorship networks !]^ , namely GRQ, 
HepTh, HepPh, CondMat and Astro, retrieved from articles 
on different topics in Physics (for more information about 
these networks refer t o j^ I. There are two location-based 
online social networks !]^ , namely Gowalla and Brightkite 
and an e-mail client network named Enron |]^ . There is also 
a graph of a piece of Youtube’s social network, along with 
a Computer Science co-authorship network of DBLP and 
product co-purchased network of Amazon |]^ . We ran our 
algorithm with three different ks, 50, 100 and 200, along with 
LPA on all the networks for fifty to two times, depending on 
size of the networks. Table]!] contains the name of the network, 
the number of nodes N, the number of edges M and the 
modularity score Q achieved by two algorithms. We have used 
the igraph python library to compute modularity for retrieved 
communities !]^ . 

Table]!] reveals that letting weaker communities grow im¬ 
proves the overall quality of detected communities. In section 
II.B we conjectured that the failure of LPA stems from unfair 


B. Dense Networks 


As an example of social networks with dense structure 
in which LPA fails completely to retrieve any community 
structure, we have found the Blog-Catalog social graph which 
we formerly analyzed in section II. In this network, the average 
and maximum degree (shown in Table IHli is significantly higher 
than previous networks we tested in section IV.b. The node 
with the maximum degree is connected to 38.7% of the nodes. 
Furthermore, there are 62 nodes connected to more than 10% 
of the network. The symptoms we mentioned are quite rare in 
social network structures. However, this shows that there exists 
instances of social networks, although rare, in which LPA fails 
to retrieve any meaningful set of communities. This means that 
we have legitimate concerns to stop this phenomenon. 


The network comes with a set of ground-truth communities, 
which enables us to measure the similarity between retrieved 
communities with the true ones. One of the popular similarity 
measures is Normalized Mutual Information (NMI). Due to 
the overlapping nature of the gro und-truth comm unities, we 
have used the code provided by iMcDaid et al. to calculate 
NMI between our communities and the true ones l]^ . Note that 
since ground-truth communities significantly overlap, the NMI 
score will be very low in general for any graph partitioning 
method, but it is still useful for the sake of comparison. 


We have compared our algorithm’s performance to the 
state-of-the-art algorithms, such as the multi-le vel mod¬ 
ularity optimization algorithm of iBlondel et al.l (denoted 
by Louvainl ?]]), the greedy modularity optimization of 
IClauset et al. (denoted by CN Mj]^) and the famous in- 
fomap algorithm !]^ proposed by iRosvall and Bergstromi We 
have also tested the data set with two of LPA’s most 
successful extens ions, namely BPA and DPA, proposed by 
ISubeli and Baied]2^|]Ml] . The results, shown in Table IHl show 
that not only our algorithm prevents a flood-fill and captures a 
community structure in this network, but also our communities 
yield relatively good results compared to other methods, except 
for CNM which is considerably slower than others. We believe 
that the relatively good resemblance between our communities 
and the true ones originates from the nature of LPA algorithm, 
which is completely intuitive regarding how social networks 
are formed in real world. Meaning that if one incorporates a 
strategy to ensure the balanced growth of communities with 
LPA’s basic method of community detection, the results can 
be of great quality. 




































TABLE II: We tested the Blog-Catalog network using the algorithms shown in the table. We have shown the average NMI over 
ten realization. As the results reveal, LPA and both of its extensions, BPA and DPA, fail to capture any meaningful community 
strucutre in this network. This highlights the importance of efficient ways to prevent the flood-fill phenomenon. Also worth 
noticing is the failure of the Louvain algorithm to capture the essence of ground-truth communities, resulting in a zero NMI 
over ten realization. Finally, note that CNM yields better results than ours with considerably higher execution time, around 50s 
in a network of this size, which might cause troubles if one intends to use CNM on massive graphs. 


N 

M 

d 

drnax 

CLPA 

LPA 

DPA 

BPA 

Infomap 

CNM 

Louvain 

10.3K 

333.9K 

64.7 

3992 

0.0006 

0.0 

0.0 

0.0 

0.0003 

0.0009 

0.0 


C. Networks with planted partitions 

In order to realize what parameters are responsible for the 
failure of LPA and get a better understanding of behavior 
of different algorithms in networks with a vague community 
structure, we have used lLancichinetti et al.l benchmark l^ . To 
challenge the detection power of these algorithms, we have set 
the size of the network (N) constant, then created networks of 
different densities {d) and community structure clarity (mixing 
parameter or p). In order to study the effects of fairness of 
degree distribution, we have also used networks with different 
maximum degree {dmax)- The results are depicted in Fig |2(a)[ 
Fig |2(b)| and Fig |2(c)[ As you can see, the algorithms based on 
propagation of data, including LPA and all of its extensions 
along with Infomap, follow a trend of giving high quality 
results and then abruptly falling to zero. The exception is 
the DPA algorithm for which the decrease in the quality of 
clusters start sooner than others, but it keeps giving mediocre 
clusterings before completely failing to identify any useful 
structure out of the networks. The pattern of behavior is 
different for two modularity optimization algorithms, namely 
Louvain and CNM, as they tend to have a more smooth fall. 
However, the smoothness is both good and bad. Good, because 
in networks with a vague community structure, they continue 
to yield communities with some degree of quality. It is also 
bad, since the decrease in the quality of clusters, compared to 
our algorithm, starts sooner. In a nutshell, even though CNM 
and Louvain keep on yielding non-zero NMI on later stages 
and hit the ground slower than us, but they give sub-optimal 
solutions when our algorithm results in optimal communities. 
All in all, our algorithm keeps giving near-perfect results, while 
other results suffer from an amount of inaccuracy, rendering us 
superior in graphs with unclear community structure. Keep in 
mind that we are achieving better results in higher p while 
giving perfect results on lower ones. Meaning that we are 
not sacrificing accuracy on the start of p spectrum to attain 
higher accuracy in the end. Furthermore, even though other 
algorithms might experience sporadic decrease in NMI even 
before p = 0.5, our algorithm stays firmly on top of the chart 
and yield near-perfect results in any case. 

Looking at the landing points - the point in which NMI 
reaches zero - of the algorithms can give us a great insight on 
the impact of the density of the networks on detection power of 
these algorithms. There is one group of algorithms, including 
LPA, BPA, DPA and Infomap, that lose their detection power 
as the density of the network increases. In other words, the 
landing point of these algorithms move to the left as the 
density grows. Interestingly, our algorithm along with Louvain 
behave differently. Meaning that as the density increases, we 


can detect communities in networks with a more ambiguous 
structure. The effect of density growth on Louvain is quite 
different as only its curvature decreases, meaning that not only 
it starts to fall on later stages, but also its quality decreases 
less in the early stages of its fall. The CNM algorithm behaves 
regardless of the density of the network, as it seems to only 
take into account the clarity of the community structure in all 
three networks. 

Now we take a look at each algorithm’s performance in the 
three networks individually and compare it to ours. BPA and 
LPA generally follow the same path. They usually yield near¬ 
perfect results in the beginning and are the first algorithms 
to fall to zero. Both of these networks yields results with 
worse quality compared to ours, in all three of the networks. 
To be more specific, not only they reach zero sooner than 
us, but also they yields equal or lower NMI compared to us 
on networks that they show a reasonable performance. DPA’s 
behavior seems to be a little more complicated as it shows a 
smooth fall in the first network (d = 20) and a sudden fall 
in the last one (d = 40). The smooth fall of DPA enables it 
to surpass our algorithm on very high p after it goes below 
our algorithm during p = 0.55 — 0.75. This does not hold 
for the other two networks, as DPA completely reaches zero 
before we even start to fall. The case is quite clear for the 
Infomap algorithm, as on all three networks, our algorithm 
outperforms it both in point-to-point comparison and in landing 
point. As we mentioned earlier, the Louvain plot has a smooth 
fall. Due to this characteristic of Louvain, we start to surpass 
it on early stages, as we yield near-perfect results and the 
Louvain suffers an amount of decrease in the quality of its 
retrieved communities. But Louvain catches us in the end 
in all three of the networks. We will further talk about the 
shortcoming of our algorithm on very high ps on all three 
of the networks. However, you should keep in mind that even 
though the Louvain keeps on giving a community strucutre, on 
average, the resemblance between the retrieved communities 
and the true ones, decrease very fast to around 0.2 in terms of 
NMI. This means that, although Louvain does not completely 
miss the essence of the community structure present in a 
network, but it also does not detect a big portion of it. The 
CNM algorithm behaves regardless of the density of networks 
and achieves lower NMI compared to all others. 

Finally, to And out the cause of our algorithm’s failure in 
very high /rs, we calculated the number of dissatisfied (remem¬ 
ber from section II. A where we discussed about dissatisfaction 
of nodes in a given clustering) nodes on all three of the 
networks for each p. The results are shown in Fig |2(d)[ Since 
there were not any dissatisfied node when p was below 0.5, we 



































(a) a class of networks with d = 20 and dmax = 100 


(b) a class of networks with d = 30 and dmax = 150 
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(c) a class of networks with d = 40 and dmax = 200 


(d) Dissatisfaction rate in the three networks in this figure. 


Fig. 2: (Color online) We have tested several algorithms on iLancichinetti et al.l benchmarks 1^ . using different d and dmax- As 
the figures suggest, our algorithm surpass others by a big margin as the density (d) increases. In addition, on the bottom-right 
figure, we have shown that the number of dissatisfied nodes increases dramatically as fi grows, meaning that without significantly 
changing the core of LPA, it is extremely difficult to detect communities in environments with very high jj,. 


did not plot the corresponding data. As you can see, when /i 
reaches the end of its spectrum, the number of unhappy nodes 
increases exponentially. As a result, the ground truth clustering 
cannot be an equilibrium state for LPA or our algorithm. This 
means that without altering the essence of LPA, it will be 
extremely difficult to detect the planted partitions in networks 
in the end of ^ range. Besides, it is extremely rare to see a ^ 
this high in real-world networks any way. 

V. Further Works & Conclusion 

We discussed how LPA has considerable potential to be 
employed in massive social networks. To name a few, its 
ability to be utilized in distributed environments, its intuitive 
approach to community detection in addition to great time and 
memory efficiency. We then outlined its major flaw, known as 
the flood-fill phenomenon. We mentioned how previous works 
have chosen two different paths to address this issue: 

• Changing the objective function of LPA. 

• Controlling the speed of growth for all communities. 

We then proposed an algorithm, choosing the second option, to 
control over-propagation of popular labels through choosing a 
small capacity and gradually increasing it over time. We then 
experimented our algorithm on several domains, namely sparse 
real-life networks, a dense real-life network and standard 
benchmarks with planted partitions. Our results showcased the 
strength of our algorithm in networks with unclear community 
structure, compared to existing robust algorithms. Since our 
method tackles the problem of flood-fills in a new way (as far 
as we know), it can be considered as a framework for others 


to propose a more efficient enhancement over our method 
to prevent flood-fills. Also, further research can be done in 
the area of finding other ways to prevent flood-fills starting 
from the very first iteration. Also an interesting area might be 
analyzing the properties of real-world networks in which LPA 
completely fails. Moreover, our method of preventing flood-fill 
can be easily incorporated with the majority of LPA extensions, 
such as BPA and DPA, since it does not change the essence 
of LPA by any means. 
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